image type
CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening
Jian, Lihua, Liu, Jiabo, Wu, Shaowu, Chen, Lihui
Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios.To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel \textit{loss integrating semantic language constraints}, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.
- Asia > China > Chongqing Province > Chongqing (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > China > Henan Province > Zhengzhou (0.04)
SEVIR: A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology Mark S. Veillette
Modern deep learning approaches have shown promising results in meteorological applications like precipitation nowcasting, synthetic radar generation, front detection and several others. In order to effectively train and validate these complex algorithms, large and diverse datasets containing high-resolution imagery are required.
- North America > United States > Massachusetts > Middlesex County > Lexington (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- South America (0.04)
- (5 more...)
- Government > Regional Government > North America Government > United States Government (0.68)
- Transportation > Air (0.68)
- Transportation > Infrastructure & Services (0.46)
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Hwang, Taebaek, Kim, Minseo, Lee, Gisang, Kim, Seonuk, Eun, Hyunjun
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
SEVIR: A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology Mark S. Veillette
Modern deep learning approaches have shown promising results in meteorological applications like precipitation nowcasting, synthetic radar generation, front detection and several others. In order to effectively train and validate these complex algorithms, large and diverse datasets containing high-resolution imagery are required.
- North America > United States > Massachusetts > Middlesex County > Lexington (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- South America (0.04)
- (5 more...)
- Government > Regional Government > North America Government > United States Government (0.68)
- Transportation > Air (0.68)
- Transportation > Infrastructure & Services (0.46)
Predicting household socioeconomic position in Mozambique using satellite and household imagery
Milà, Carles, Matsena, Teodimiro, Jamisse, Edgar, Nunes, Jovito, Bassat, Quique, Petrone, Paula, Sicuri, Elisa, Sacoor, Charfudin, Tonne, Cathryn
Many studies have predicted SocioEconomic Position (SEP) for aggregated spatial units such as villages using satellite data, but SEP prediction at the household level and other sources of imagery have not been yet explored. We assembled a dataset of 975 households in a semi-rural district in southern Mozambique, consisting of self-reported asset, expenditure, and income SEP data, as well as multimodal imagery including satellite images and a ground-based photograph survey of 11 household elements. We fine-tuned a convolutional neural network to extract feature vectors from the images, which we then used in regression analyzes to model household SEP using different sets of image types. The best prediction performance was found when modeling asset-based SEP using random forest models with all image types, while the performance for expenditure- and income-based SEP was lower. Using SHAP, we observed clear differences between the images with the largest positive and negative effects, as well as identified the most relevant household elements in the predictions. Finally, we fitted an additional reduced model using only the identified relevant household elements, which had an only slightly lower performance compared to models using all images. Our results show how ground-based household photographs allow to zoom in from an area-level to an individual household prediction while minimizing the data collection effort by using explainable machine learning. The developed workflow can be potentially integrated into routine household surveys, where the collected household imagery could be used for other purposes, such as refined asset characterization and environmental exposure assessment.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (11 more...)
- Materials (1.00)
- Health & Medicine > Epidemiology (0.69)
- Government > Regional Government (0.68)
- (2 more...)
An Evaluation of GPT-4V and Gemini in Online VQA
A comprehensive evaluation is critical to assess the capabilities of large multimodal models (LMM). In this study, we evaluate the state-of-the-art LMMs, namely GPT-4V and Gemini, utilizing the VQAonline dataset. VQAonline is an end-to-end authentic VQA dataset sourced from a diverse range of everyday users. Compared previous benchmarks, VQAonline well aligns with real-world tasks. It enables us to effectively evaluate the generality of an LMM, and facilitates a direct comparison with human performance. To comprehensively evaluate GPT-4V and Gemini, we generate seven types of metadata for around 2,000 visual questions, such as image type and the required image processing capabilities. Leveraging this array of metadata, we analyze the zero-shot performance of GPT-4V and Gemini, and identify the most challenging questions for both models.
- North America > United States > North Dakota > Burke County (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- Africa > South Africa (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Do humans and Convolutional Neural Networks attend to similar areas during scene classification: Effects of task and image type
Müller, Romy, Dürschmidt, Marcel, Ullrich, Julian, Knoll, Carsten, Weber, Sascha, Seitz, Steffen
Deep Learning models like Convolutional Neural Networks (CNN) are powerful image classifiers, but what factors determine whether they attend to similar image areas as humans do? While previous studies have focused on technological factors, little is known about the role of factors that affect human attention. In the present study, we investigated how the tasks used to elicit human attention maps interact with image characteristics in modulating the similarity between humans and CNN. We varied the intentionality of human tasks, ranging from spontaneous gaze during categorization over intentional gaze-pointing up to manual area selection. Moreover, we varied the type of image to be categorized, using either singular, salient objects, indoor scenes consisting of object arrangements, or landscapes without distinct objects defining the category. The human attention maps generated in this way were compared to the CNN attention maps revealed by explainable artificial intelligence (Grad-CAM). The influence of human tasks strongly depended on image type: For objects, human manual selection produced maps that were most similar to CNN, while the specific eye movement task has little impact. For indoor scenes, spontaneous gaze produced the least similarity, while for landscapes, similarity was equally low across all human tasks. To better understand these results, we also compared the different human attention maps to each other. Our results highlight the importance of taking human factors into account when comparing the attention of humans and CNN.
- South America > Suriname > North Atlantic Ocean (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Texas > Loving County (0.04)
- (12 more...)
A Study on Improving Realism of Synthetic Data for Machine Learning
Shen, Tingwei, Zhao, Ganning, You, Suya
Synthetic-to-real data translation using generative adversarial learning has achieved significant success in improving synthetic data. Yet, limited studies focus on deep evaluation and comparison of adversarial training on general-purpose synthetic data for machine learning. This work aims to train and evaluate a synthetic-to-real generative model that transforms the synthetic renderings into more realistic styles on general-purpose datasets conditioned with unlabeled real-world data. Extensive performance evaluation and comparison have been conducted through qualitative and quantitative metrics and a defined downstream perception task.
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- North America > United States > California > Alameda County > Berkeley (0.14)
Conditional Progressive Generative Adversarial Network for satellite image generation
Cardoso, Renato, Vallecorsa, Sofia, Nemni, Edoardo
Image generation and image completion are rapidly evolving fields, thanks to machine learning algorithms that are able to realistically replace missing pixels. However, generating large high resolution images, with a large level of details, presents important computational challenges. In this work, we formulate the image generation task as completion of an image where one out of three corners is missing. We then extend this approach to iteratively build larger images with the same level of detail. Our goal is to obtain a scalable methodology to generate high resolution samples typically found in satellite imagery data sets. We introduce a conditional progressive Generative Adversarial Networks (GAN), that generates the missing tile in an image, using as input three initial adjacent tiles encoded in a latent vector by a Wasserstein auto-encoder. We focus on a set of images used by the United Nations Satellite Centre (UNOSAT) to train flood detection tools, and validate the quality of synthetic images in a realistic setup.
Run image classification with Amazon SageMaker JumpStart
Last year, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart hosts 196 computer vision models, 64 natural language processing (NLP) models, 18 pre-built end-to-end solutions, and 19 example notebooks to help you get started with using SageMaker. These models can be quickly deployed and are pre-trained open-source models from PyTorch Hub and TensorFlow Hub. These models solve common ML tasks such as image classification, object detection, text classification, sentence pair classification, and question answering. The example notebooks show you how to use the 17 SageMaker built-in algorithms and other features of SageMaker.